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Abstract: Matrix completion algorithms recover a low rank matrix from a small 
fraction of the entries, each entry contaminated with additive errors. In practice, 
the singular vectors and singular values of the low rank matrix play a pivotal role 
for statistical analyses and inferences. This paper proposes estimators of these 
quantities and studies their asymptotic behavior. Under the setting where the di¬ 
mensions of the matrix increase to infinity and the probability of observing each 
entry is identical, Theorem [l] gives the rate of convergence for the estimated singu¬ 
lar vectors; Theorem [3] gives a multivariate central limit theorem for the estimated 
singular values. Even though the estimators use only a partially observed matrix, 
they achieve the same rates of convergence as the fully observed case. These esti¬ 
mators combine to form a consistent estimator of the full low rank matrix that is 
computed with a non-iterative algorithm. In the cases studied in this paper, this 
estimator achieves the minimax lower bound in Koltchinskii et al. (2011a). The 
numerical experiments corroborate our theoretical results. 


Key words and phrases: Matrix completion, low rank matrices, singular value de¬ 
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1 Introduction 

The matrix completion problem arises in several different machine learning and 


engineering applications, ranging from collaborative filtering (Rennie and Srebro 


(2005)), to computer vision (Weinberger and Saul (2006)), to positioning (Monta- 


nari and Oh (2010)), and to recommender systems (Bennett and Lanning (2007)). 


The literature has established a sizable body of algorithmic research (Rennie and 


Srebro 

(2005' 

; Keshavan et al. 

2009|; Cai et al. 

(2010) 

Mazumder et al. (2010 

Hastie et al. 

2014) 

Cho et al. 

2015)) and theoretical results ( 

Fazel 

(2002); 

Sre 


1 This research is supported by NSF grant DMS-1309998 and ARO grant W911NF-15-1-0423. 





























































































bro et al. (2004); C 

landes and Rech 

t (2009 

); Candes and Plan 

(2010) 

; Keshavan 

et al. 

(2010); 

Recb 

t (2011); 

Gross 

(2011); 

Negahban et al. (2 

Oil); I 

koltchinskii 

et al. 

2011a) 

; Rohde et al. 

(2011); 

Koltchinskii et al. (2011b); 

Candes and Plan 

(2011) 

Negahban and Wainwright ( 

2012); 

Cai and Zhou (2013 

; Davenport et al. 

(2014) 

Chatterjee 

(2014)). This extant literature is primarily focused on estimat- 


ing the unobserved entries of the matrix. In several of these previous estimation 
techniques, the algorithms first estimate the singular vectors and singular values 
of the low rank matrix. Also, based upon classical multivariate statistics, these 
singular vectors and singular values can serve various types of statistical analy¬ 
ses and inferences. For example, the overarching aim in the Netflix problem was 
to predict the unobserved him ratings and the previous algorithms and theories 
served this purpose. However, if one wishes to interpret the resulting model 
predictions, then the estimated singular vectors and singular values can provide 
insights on (i) the main latent factors of him preferences and (ii) their relative 
strengths, respectively. In the Netflix example, 


“The hrst factor has on one side lowbrow comedies and horror movies, 
aimed at a male or adolescent audience (Half Baked, Freddy vs. Ja¬ 
son), while the other side contains drama or comedy with serious 
undertones and strong female leads (Sophie’s Choice, Moonstruck). 
The second factor has independent, critically acclaimed, quirky hlms 
(Punch-Drunk Love, I Heart Huckabees) on one side, and mainstream 
formulaic hlms (Armageddon, Runaway Bride) on the other side.” 
(Koren et al. (2009)) 


This inference is based upon the leading singular vectors of the estimated matrix. 
To the best of our knowledge, no previous research has studied the statistical 
properties of the estimated singular vectors and singular values. 

This paper proposes estimators of the singular vectors and singular values of 
the low rank matrix as well as an estimator of the low rank matrix itself. First, 
Lemma [I] studies the singular vectors and singular values of a partially observed 
matrix that simply substitutes zeros for the unobserved entries; the resulting 
estimators are biased. The proposed estimators adjust for this bias. Theorem [l] 
hnds the convergence rate for the bias-adjusted singular vector estimators and 
Theorem [3] gives a multivariate central limit theorem for the bias-adjusted sin- 





































































































































gular value estimators. Despite the fact that the proposed estimators are built 
upon a partially observed matrix, they converge at the same rate as the stan¬ 
dard estimators built from a fully observed matrix up to a constant factor which 
depends on the probability of observing each entry. Combining the proposed 


singular vector and value estimators, Section [472] gives a one-step consistent esti¬ 
mator of the low rank matrix which does not iterate over several singular value 
decompositions or eigenvalue decompositions. The mean squared error of this 


estimator achieves the minimax lower bound in Theorems 5-7 (Koltchinskii et al. 


(201 laj)). 

The rest of this paper is organized as follows. Section [2] describes the model 
setup. Section [3] shows that the singular vectors and singular values of a partially 
observed matrix are biased and suggests a bias-adjusted alternative. Section 


4.1 finds (1) the convergence rates of the estimated singular vectors and (2) the 


asymptotic distribution of the estimated singular values. Section |4l2| proposes and 
studies a one-step consistent estimator of the full matrix. Section [5] corroborates 
the theoretical findings with numerical experiments. Finally, Section [6] provides 
the proofs of our main theoretical results. The proofs of the other results are 
collected in the Appendix. 


2 Model setup 

The underlying matrix that we wish to estimate is an nx d matrix Mq with rank 
r. By singular value decomposition (SYD), 

M 0 = UAV t , (1) 

for orthonormal matrices U = (Ui,...,U r ) £ M nxr and V = (V\,... ,V r ) 6 
con taining the left and right singular vectors, and a diagonal matrix A = 
diag(Ai,..., A r ) e W xr containing the singular values. Mo is corrupted by noise 
e 6 M nxrf , where the entries of e are i.i.d. sub-Gaussian random variables with 
mean zero and variance a 2 . Let y £ {0,1 } nxd be such that ykh = 1 if the (k, h)- 
th entry of Mo + e is observed and y^h = 0 otherwise. The entries of y are 
i.i.d. Bernoulli(p) and independent of the entries of e. Thus, the total number of 
observed entries in Mq + e is a Binomial (nd, p) random variable. We observe y 







and the partially observed matrix M G M nxrf , where 


Mkh — [y ■ {Mq + e) kh 


{ Mokh + e kh 

0 


if observed (y kh = 1) 
otherwise {ykh = 0) 


for 1 < k < n and 1 < h < d. Throughout the paper, it is presumed that 
r <C d < n. Moreover, the entries of Mq are bounded in absolute value by a 
constant L > 0. 


Remark 1. Depending on the case, the noise e can be related to the measurement 
system so that assuming that there exist errors for unobserved entries does not 
make sense. Hence, assume a hierarchical model as follows; 


£ij\Vij = 0 = 0 a.s., 
e ij I Vij — 1 subgaussian, and 
y VJ ~ i.i.d. Bernoulli(p). 

In this setting, the results obtained in this paper would still hold although it 
may require more techniques or minor changes in the proof. For simplicity of the 
paper, we only focus on the original setting. 


3 Estimation of singular values and vectors of Mq 

The vast majority of previous estimators of Mq have been initialized with M, in 
effect imputing the missing values with zero. In this section, we study the prop¬ 
erties of singular vectors and values of M. This suggests alternative estimators 
of the singular vectors and values of Mq. 

3.1 Properties of singular values and vectors of M 

Define 

E := M t M and t t := MM T . 

Then, the eigenvectors of H and E* are the same as the right and left singular 
vectors of M, respectively, and the squared root of eigenvalues of E are the same 
as the singular values of M. The following lemma shows that E and E* are biased 
estimators of Mq Mq and MqMq . respectively. 



3.1 Properties of singular values and vectors of M 


Lemma 1 . Under the model setup in Sectionwe have 

ES = p 2 Mq Mq + p( 1 — p ) diag(MQ Mq) + npa 2 I dl (2) 

and similarly, 

EEj = p 2 MqMq + p( 1 - p) diag(M 0 Mv) + dpa 2 I n , (3) 


where Id and I n are d x d and n x n identity matrices, respectively. 


The proof of this lemma is in Appendix A.l The right-hand side of ([2]) 
contains terms beyond p 2 Mq Mq and they make the singular vectors and sin¬ 
gular values of M biased estimators of the singular vectors and values of Mq. 
While the bias coming from npa 2 I d is manageablcj^j the bias coming from p{ 1 — 
p) diag(MjMo) is not. The same applies to Yi t in ([3]). 

To get rid of the terms producing unmanageable biases, we define and 
Tipt and their eigenvectors and eigenvalues as follows, 


t p :=£ - (1 -p) diag(E) 

=(V P , V pc ) diag(A^,..., A 2 pd )(V p , V pc ) T , and 
t p t :=t t - (1 - p) diag(Et) 

=(Up, U pc ) diag(Ap tl ,..., X 2 ptn )(U p , U pc ) T , 

where 


Vp = (V Pv ..., Vp r ) € R dxr , v pc = (Vp r+1 , ...,V Pd )e R d *( d ~ r \ 

Up = ( U PV ..., U Pr ) G M nxr , U pc = ( U Pr+v ..., U Pn ) € M nx(n - r) . 

The following proposition shows that S p and T, p t adjust the bias. 

Proposition 1. Under the model setup in Sectionwe have by eigendecompo- 
sition, 

Etp= p 2 Mq Mq + np 2 a 2 I d = (V, V C )A 2 p (V, V C ) T and 
E t pt = p 2 MqMq + dp 2 o 2 I n = {U, U c )A 2 pt (U, U C ) T , 

2 This term does not change the singular vectors of EE; it merely increases each singular value by 
npa 2 . 





3.2 Estimators of singular values and vectors of Mq 


where V and U are as defined in Q, F c £ K dx ( d r ') > U c € R nX ( n r \ 

Ap = diag(A P1 ,..., A p ^) 

= diag(p 2 [ A 2 + na 2 ],..., p 2 [X 2 + na 2 ], p 2 na 2 ,..., p 2 na 2 ) G R dxd , and 
A 2 t = diag(p 2 [ A 2 + dcr 2 ],... ,p 2 [X 2 + da 2 ], p 2 da 2 ,..., p 2 da 2 ) € M nxn . 


The proof of this proposition easily follows from Lemma 0 and Q. 

Proposition 0 shows that the top r eigenvectors of E and E are the 
same as the right and left singular vectors of Mq, respectively. Also, the top 
r eigenvalues of E E p are easily adjusted to match the singular values of Mq as 
follows, 

2 1 ■' 2 o 

Aj = —j\ Pi - na , for i = 1,..., r. 


3.2 Estimators of singular values and vectors of M 0 

The results in Proposition [l] suggest plug-in estimators using the leading eigen¬ 
vectors and eigenvalues of E p and the leading eigenvectors of as estimators of 
V, A, and JJ, respectively. However, since p is an unknown parameter in practice, 
the proposed estimators use instead of p the proportion of observed entries in M, 
p, which is defined as 

~ _ Mk =1 Mh =1 Vkh 
nd 

Using p, define T, p and T,p t as 

Ep := E - ( 1 - p) diag(E) and t pt :=%-{l-p) diag(Et). (6) 


By eigendecomposition, 

t p = (V,V c )Aj(V,V c ) T and t pt = (U, U c ) A% (U, U C ) T , (7) 

where V G R dxr , V c G R dx ( d ~ r \ A| = diag(A| 1; ..., A| d ) G R dxd , U G M nxr , 
U c G M nx ( n-r ) ) and A| t = diag(A| tl ,. .., A| fn ) G M nxn . Then, estimate the left 
and right singular vectors, U and V, of Mq by U and V, respectively. Also, 
estimate the singular values, A*, i = 1,..., r, of Mq by 



(8) 





where tp = g^tr (v^tpV c ^j . 

For any A 6 M rix ' i , let the i-th left singular vector of A be denoted by Uj(A), 
the i-th right singular vector of A by v,;(A), and the top 'i-tli singular value of A 
by A i(A) for i = 1 Then, Algorithm [I] summarizes the steps to compute 

the proposed estimators of the singular values and vectors of Mq. 

Algorithm 1 Estimators of Ui , Vi, and Xi for i = 1,..., r 
Require: M, y, and r 

/v 1 y—\n 

P Tid 2-^k=1 2-ih =1 V kh 

tp <- M t M - (1 - p)diag(M t M) 

t t p <- MM t - (1 - p)diag(MM T ) 
tp), Vie{l, ...,r} 

Ui <- Ui(Spt), Vie {l,...,r} 
he e- 2=7 Ei= r +i Aj(Ep) 

^ p \J Ai(Ep) — Tp, Vie{l,...,r} 

return V, Ui, and A i for i = 1 , ..., r 


4 Asymptotic theory 

This section investigates the statistical properties of the estimators proposed in 
0 and Q. 

4.1 Convergence rate of the estimated singular vectors and asymp¬ 
totic distribution of the estimated singular values 

Let x = (xi,..., x n ) T be a n-dimensional vector and A = (A k h) a n x d matrix. 
Then, the £ p -norm is defined as follows, 

/ v \Vp 

IM| p = ( \ Xi \ P J > and Pllp = su p{||Ae|| p , Ikllp = !}> P = 1,2,00. 

The spectral norm || Al ||2 is a square root of the largest eigenvalue of AA T , 

n d 

Pill = ™ax V \A kh \, and \\A\\ = max V \A kh \. 

\<h<d A —' Kk<n A —' 







4.1 Convergence rate of the estimated singular vectors and asymptotic distribution of 

the estimated singular values 

The squared Frobenius norm is defined by ||A||^ = tr (A T j 4), the trace of A T A. 
We denote by c > 0 and C > 0 generic constants that are free of n, d, and p, and 
different from appearance to appearance. 

To measure how close the proposed estimator V is to V (or, U to U), we 
introduce a classical notion of distance between subspaces. Let TZ(Zi) denote a 
column space spanned by Z\ E M rfxr and 1Z(Z 2 ) by Z 2 E W ixr . Then, to measure 
the dissimilarity between 1Z(Z\ ) and 1Z(Z 2 ), consider the following loss function 

\\sm(Z 1 ,Z 2 )\\ 2 F = \\ S in@(n(Z 1 ),n(Z 2 ))\\ 2 F , 


where sin @(1Z(Zi), 1Z(Z 2 )) is a diagonal matrix of singular values (canonical 
angles) of P\P^- with orthogonal projections Pi and P 2 of Z\ and Z 2 . respectively. 
Here P 1 - = I — P. The canonical angles generalize the notion of angles between 
lines and are often used to define the distance between subspaces. If the columns 
of Z\ and Z 2 are singular vectors, IZ(Zi) and 1Z(Z 2 ) have projections Pi = 
Z\Z( and P 2 = Z 2 Z\ , respectively, and || sin(Zi, Z 2 )\\ 2 F = \\ZiZf (Z 2 Zj)- i ~\\‘^ = 
\\\Z\Zj — Z 2 Z 2 \\ 2 f - Proposition 2.2 in |Vu and Lei| ( |2013[ ) relates this subspace 
distance to the Frobenius distance 


l inf \\Z 1 -Z 2 0\\ 2 F <\\sm{Z l ,Z 2 )\\ 2 F < inf ||Zi - Z 2 0\\ 2 F , (9) 

2oev rr oev„ 


where V Tir = {O E M rxr : O t O = I r and OO t = I r } denotes the Stiefel manifold 
of rxr orthonormal matrices. In other words, the distance between two subspaces 
corresponds to the minimal distance between their orthonormal bases. 


Assumption 1. 

(1) A i = biVnd, i = 1,... ,r, where \ <bi < c for a constant c > 0; 

(2) there exists a constant m E {1,..., r} such that b rn > b m+ 1 , where b r+ 1 = 0; 

(3) d <n < e da for a constant a < 1 free of n, d, and p. 

Remark 2. To motivate Assumption [I] (1), suppose that a non-vanishing pro¬ 
portion of entries of Mo contains non-vanishing signals (i.e. > co for some 

constant Co > 0) and that the rank of Mq is fixed. Then, 

n d 

M( ^h = \\ m o\\ f > end 

k =1 h =1 





4.1 Convergence rate of the estimated singular vectors and asymptotic distribution of 

the estimated singular values 

for some constant c > 0. Because the squared Frobenius norm is also the sum of 
the squared singular values of Mo, the order of the singular values of Mo should 
be y/nd (see alsoiFan et al. (2013)). Assumption |T]( 1) may seem uncommon in 
the matrix completion literature, but consider the widely-used assumption (II.2) 


m 


Candes and Plan (2010), 


max \Uik\ < y/Cjn and max \Vih\ < \fcjd 

l<k<n l<h<d 

for i = 1, ...,r and a constant C > 1, which prevents spiky singular vectors. 
Under the model setup in Section [2] where the entries of Mo are bounded in 
absolute value by a constant L > 0, this implies Assumption [^1). 

The following theorem shows the convergence of V to V and U to U. 

Theorem 1. Under the model setup in Section ^ and Assumption 0 let 
and u( m '> be the first m columns ofV and U defined in ([?]), respectively, and let 
V( m ' ) and be the first m columns of V and U defined in (jT[) . respectively. 

Then, for large n and d, 


and 


E 


E 


sin (y( m \ 


sin (U (m) ,U (m) ) 


< 


Cm 


-i 


F P( b m~ b m+ 1) 2 


< 


C 2 d 


-l 


F P( b m~ b m+ 1) 2 ’ 
where C\ and C 2 are generic constants free of n, d, and p. 


( 10 ) 


(ID 


The proof of this theorem is in Section 6.1 


Remark 3. As long as 


oo, the convergence rates in Theorem 


0 


will hold. 


Hence, under the setting where p goes to zero, if d/logn diverges fast enough 
that 7 !—- > 00 , we can still obtain the results in Theorem 

log n ’ 


Remark 4. Despite the fact that V^ rn ' > is built on a partially observed matrix 

- 1/2 


M, Theorem 


gives the convergence rate ,, 2 W , 2 —r which is the standard con- 
E wgt Pm±ll 


vergence rate Tor eigenvectors (Anderson et al. (1958)). The effect of the partial 


observations appears in the denominator of the right-hand side of (10) as p. A 
similar discussion applies to u( m ' ) in < 0 - 
































4.1 Convergence rate of the estimated singular vectors and asymptotic distribution of 

the estimated singular values 


The next theorem shows the asymptotic distribution of A 2 centered around 


Theorem 2. Suppose nd 1 — >• oo. Then, under the model setup in Section [J| 
and Assumption [7J we have 

E m \2 >2 

i= 1 2-,i=l \ 


Vndcr\ 


A/"(0,1) in distribution, as n and d —> oo. 


where 


= 


4(1 — p) f 


n d 


p 


EE m °M E^M - E 


m 2 

k 2 


. o m 

4(7 \ ^ 2 

+ — E 6 *’ 

T) Z ^ 


V fc=l /l=l X 2=1 7 X 2=1 

Uik is the k-th entry ofUi, and Vih is the h-th entry of Vi 


2=1 


The proof of this theorem is in Section 6.2 


Remark 5. As long as —> oo and pnd—> oo, the asymptotic normality 

result in Theorem [2] will hold. Hence, under the setting where p goes to zero, if 
d/ log n and n/d diverge fast enough that —> oo and pn/d —> oo, we can still 
obtain the results in Theorem [H 


Remark 6. Theorem [ 2 ] shows that the convergence rate of YliLi is Vnd- 
Considering Assumption [lj 1), it is an optimal rate. However, since the results 
are based on partially observed entries, the asymptotic variance, cr 2 , increases 
with the rate p~ l . For example, when we have a fully-observed matrix, a\ simply 
becomes 4cr 2 & 2 which is a lower bound for a 2 . 

One of the main purposes of this paper is to investigate asymptotic behaviors 
of the estimators of the singular values of M$. An application of the proof of 
Theorem [2] and the delta method provides a multivariate central limit theorem 
for Ai,..., A r . 


Theorem 3. Suppose that 

bi > bi + 1 for all i E {1,..., r} and nd~ 1 —> 00 . 

Then, under the model setup in Section and Assumption [IJ we have 
^Ai — Ai^ 

T” 1 / 2 ■ —> Af (0,I r ) in distribution, as n and d —> 00 , 

yA r A rJ 








4.2 A consistent estimator of Mq 


where T = Y r e 


rxr consists of 
d 


T ■ — 


(t-p) [ v n 

p 


(t-p) f v*™ 
p 


(J2k =1 Eft=l Mokh^ik v ih 
2 




ifi = 3 
if i ~f~ 3- 


( 12 ) 


oo 


(ELi Eti M 0 i h u ik v ih u jk v jh - bib,' 

Thus, | A* - Aj| = O p . 

Remark 7. As in case of Theorem |2j (see Remark |H ). as long as 
and pnd _1 —> oo, the asymptotic normality result in Theorem [ 3 ] will hold. Note 
that Theorems [ 2 ] and [ 3 ] require an additional condition, pnd~ 1 —> 00 , to the 
condition required for Theorem |TJ —> 00 . Under the setting where p is 

a constant, this additional condition implies that d/n has to go to zero. The 
rationale behind this is as follows. In Theorems [2] and [3j we find the limiting 
distribution on the singular values of Mq from a d x d matrix Tip, while the total 
number of observations is nd. That is, the size of our parameter space is d 2 
and the total amount of information we can use to find asymptotic properties on 
the parameters is nd. Since our observations are even noisy, we need an enough 
number of observations to achieve our goal. When d/n —> 0, we can make the 
approximation errors in the singular values of Tp negligible and find the limiting 
distribution on the singular values of Mq. 


Remark 8. The results of Theorems [2] and [3] help us to make statistical inference 
on the singular values of Mo. For example, they open up possibilities for us to 
evaluate how many factors are significant or how influential each factor is, by 
providing the distribution of the singular values. 


Theorems [Tpl show that the proposed estimators for U, V, and Aj’s are asymp¬ 
totically unbiased and have optimal convergence rates. With these well-developed 
estimators for the singular values and vectors of Mq, the following section pro¬ 
poses a consistent estimator of Mq. 


4.2 A consistent estimator of M 0 

Suppose that bi > for all i = 1,..., r. Theorem [l] and Q imply that V t and 
Ui can estimate R and Ui up to constant factors sign((U, V/)) and sign ({Ui, Uf)), 
respectively. Let sq = (sqi, ..., sor) £ {—1, l} r be 


s 0i = sign((U, Vi)) sign((Ui, 17*)) for ie r}. 


(13) 






4.2 A consistent estimator of Mq 


Then, M{sq ) = Ya =i s o* \UiV? becomes a consistent estimator of Mo- However, 
since so is an unknown parameter in practice, we employ s = (si,...,s r ) G 
{—1 , l} r as an estimator of sq; 


s= argmin \\Vq(M(s)) - Vq(M) \\ 2 f , (14) 

where H contains indices of the observed entries, ykh = 1 44- (k, h) G 12, and 
Vq(A) for any A G M. nxd denotes the projection of A onto 12, 


'Pn(A)kh = | kh ' ^ ^ for 1 < k < n and 1 < h < d. 

\ o if (k,h)$n ~ ~ ~ ~ 

Hence, the proposed estimator of Mq is 


M(s) = Y^ § i hUiV?. 


1=1 


Remark 9. Finding s as in (14) requires 2 r computations, 
a computational bottleneck or even impossible for a large r. 
suggest an alternate way to find s as follows; 


(15) 

Hence, it can be 
In such cases, we 


3 alternate 


sign((V-, V j(M))) sign((£/*, u*(M))) for i = 1,..., r. 


Note that if we use Vi and Ui instead of Vj(M) and u,;(M), this gives us the true 
sign sq hi (13). 


In the following we show that M (s) is a consistent estimator of Mq under cer¬ 
tain conditions. The steps to compute M(s) using {V), U t , Ai}[ =1 from Algorithm 
[I] are summarized in Algorithm [2j 


Algorithm 2 Estimator of M 0 


Require: V), t/j, and A, : for i = 1,..., r 


s «- arg mm se{ _ ljl}r 


Pn(E [ =1 s^Vf) -P n (M) 


M(s) <- E[=! ^A^Vf 

return M(s) 


2 

F 


Assumption 2. 

(%) lim Tl _ > . 00 )d _ > . 00 P^min se {_ 1)1 }r ||'Pq(M(s)) 

< \\Vn(M(s 0 )) -Vn{M)f F ) = 0; 












4.2 A consistent estimator of Mq 


(2) bi> b i+ 1 for all i = 1 ,..., r. 


Remark 10. When the rank r is 1, it is more straightforward to understand 
Assumption [2^1). Assuming that so = 1, it means that 

lim ¥(\\V n (-XUV T )-Vn{M)\\ 2 F 

n — yoCjd —^oo y 

< \\Vn{XUV T ) -Pn(M)||^) =0. 

That is, the probability that s picks a different sign than the true sign so = 1 
goes to zero with the dimensionality. Given the asymptotic properties of our 
estimators A, U, and V, this is not an unreasonable assumption to make. 


Theorem 4. Under the model setup in Section and Assumptions [7]j^, for any 
given rj > 0 , there exists a constant C v > 0 such that for sufficiently large n, 


pbf 

n 


M(s) — Mq 



< 77 . 


Or alternatively, 

II M(s) - M 0 \\ 2 f = —\j 0 p(h n n ), 

where h n can be anything that diverges very slowly with the dimensionality, for 
example, log(logd). 


The proof of this theorem is in Section 6.3 


Remark 11. As in case of Theorem 


3|), as long as -> 00 , the 


(see Remark 

convergence rates in Theorem [ 4 ] will hold. If we let p = ^ so that N represents 
the number of observed entries in the population sense, this condition implies 
that nl ^ n —> 00 . Therefore, for M(s) to be consistent, the number of observed 
entries should increase at a faster rate than n\ogn. This is a comparable result 


to Theorem 1 in Candes and Plan 


Remark 12. The additional condition, pnd -1 —> 00 , required for Theorems [ 2 ] 
and [3] (see Remarks [5] and [7]) , is not needed for Theorems [I] and [4j It means that 
if p is a constant, even though d/n —> c for some 0 <c<lord<n, the results 
in Theorems [l] and [4] will still hold, but the results in Theorems [2] and [3] will not. 


















4.2 A consistent estimator of Mq 


Remark 13. Theorem [i] shows that ^|| M(s) — Mq\\ 2 f is bounded by Cp~ 1 d~ 1 
for some constant C > 0. Under the setting where the rank of Mq is fixed as 
in this paper, this is matched to the minimax lower bound in Theorems 5-7 


(Koltchinskii et al. (2011a)). The previous estimators that obtain the minimax 


rate are computed via semidefinite programs that require iterating over several 
SYDs. However, the proposed estimator is a non-iterative algorithm. 


Remark 14. Chatterjee (2014) established the minimax error rate for estimators 


of a general class of noisy incomplete matrices which extend beyond low rank 
matrix completion. In the regime studied herein, the convergence rate of our 


estimator of Mq is faster than the convergence rate in Theorem 2.1 (Chatterjee 


(2014|)). This is likely because we consider a smaller class of matrices, where the 
singular values of a low rank matrix have the divergence rate \fnd (Assumption 
[ljl)). Remark [2] justifies this assumption in the setting of low rank matrix 
completion. 


Throughout this paper, we have assumed that the rank, r, of Mq is known. 
However, it is an unknown parameter and needs to be estimated. The following 
lemma proposes an estimator of r and shows its consistency. 

Lemma 2. Let Cd > 0 such that Cd/d —> 0 and Cd — > oo, for example, Cd = 
clogd for any c > 0. Also, let r = |{i € {1, ... ,d} | A|- > p 2 nCd}\ where A|- is 
defined in ([T]). Then, for any given 5 > 0, we have 

P(r = r) = 1 — 0(n~ s ). 


The proof of this lemma is in Appendix |A.5| 

Remark 15. Empirically to find Cd and f in Lemma [2j we suggest using a scree 
plot of the singular values of in ([6]). 

Remark 16. As long as Cd satisfies cr 2 p 2 n < p 2 nCd < (a 2 +b 2 d) p 2 n, consistency 
of r in Lemma[2]will hold. However, in the finite sample case, if the noise level a 2 
is larger than b 2 d, it can be difficult to observe a singular-value gap and determine 
r using the scree plot of the singular values of Tp. 















5 Numerical experiments 


5.1 Simulations 

This section studies the performance of the proposed estimators using several 
values of the dimension n and the probability p. 

To simulate Mo, generate A e [—5,5] nx2 , B e [—5, 5] rfx2 to contain i.i.d. 
Uniform[—5, 5] random variables and define 

M 0 = AB t € M nxd . 

Each entry of Mo is observed with probability p and unobserved with probability 
1 — p. The observed entries of Mo are corrupted by noise e as defined in Section 
[2j where ej-h are i.i.d. jV(0,1). The dimension n varies from 100 to 1000 and p 
from 0.1 to 1, while d = 2 y/n. Each simulation was repeated 500 times and the 
errors were averaged. 

Figures [l] and [2] summarize the resulting mean squared errors calculated by 
^||M(s) - M 0 \\ F , ||diag(Ai, A 2 ) - A||^, \\V - V\\ 2 F , and || U - U\\ 2 F , when n and 
p increase along the x-axis, respectively. The MSE for V decreases more rapidly 
than the MSE for U and both MSEs decrease when p increases; this is consistent 
with the results in Theorem [lj The MSE of M decreases with the increase of n 
and p. The MSE of A stays stable over the changes of n since it is measured on 
A i instead of A? (see Theorem [ 3 ]) , but decreases with the increase of p. 

We further studied the asymptotic normality of X^i=i A i n Theorem [ 3 J Fig¬ 
ure [i] graphs the QQ plot of Yli= 1 where the dimension n is fixed 

at 1000 and p varies from 0.1 to 1. This shows that the asymptotic normality 
holds across various values of p. 


5.2 A data example 

To illustrate the proposed estimation methods, this section analyzes the Movie- 
Lens 100k data (GroupLens (2015)). The data set consists of 100,000 ratings 
from 943 users and 1682 movies and each user has rated at least 20 movies. 
Taking this partially observed data matrix as M, we computed Tip as in ([6]) and 
plotted the scree plot of the singular values of Tp to determine f. Figure [ 4 ] shows 
the result. Since there exists a singular value gap between the 3rd and 4th sin- 
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5.2 A data example 


WISE of M MSE of A. 



Figure 1: The mean squared errors for six different values of p when n increases. Each 
point on the plots correspond to an average over 500 replicates. 




























5.2 A data example 


MSEofM MSE of X 



p p 


Figure 2: The same mean squared errors as the ones in Figure 1 plotted for four different 
values of n when p increases. Each point on the plots correspond to an average over 500 
replicates. 
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5.2 A data example 



QQ Plot oU-A,(p = 0.7) 



Theoretical Quantiles 


QQ Plot of 3L-A.(p = 0.9) 



Theoretical Quantiles 


QQ Plot of 3L-X(p = 1) 



Theoretical Quantiles 


Figure 3: Asymptotic normality of \ ~ Si=i \ as P vai 'ies from 0.1 to 1. Across 
the plots, we fixed n to be 1000. 









5.2 A data example 


A 

The singular values of 



Figure 4: The singular values of Sp computed by taking the MovieLens 100k data matrix 
as M. From this scree plot, we choose f to be 3. 


gular values, we chose r = 3. Then, we computed the estimators of the singular 
vectors and values and the estimator of the full low rank matrix as illustrated in 
Algorithms [l] and [2j 

The estimated singular vectors help us understand what the main factors of 
movie preferences are. Table [l] shows lists of movies that characterize the top 3 
singular vectors (factors of movie preferences). Particularly, it presents 5 movies 
that correspond to the largest values in each singular vector and 5 movies that 
correspond to the smallest values. The 1st factor has well-known and top-rated 
movies on one side and unknown and poorly-rated movies on the other side. The 
2nd factor has box-office hit movies in 1990’s on one side and memorable classic 
movies in 1940’s-1960’s on the other side. The 3rd factor has action and thriller 
movies on one side and quieter and drama movies on the other side. 

The estimated singular values help us see how influential the main factors 
of movie preferences are. Particularly, Figure [5] shows the estimated singular 
values and their 95% confidence intervals. For the standard deviation used in 
the confidence intervals, we used from (12) in Theorem]^ Computing 

T-- 7 requires information on the values of the parameters My, U, V, X t ,p, and 






5.2 A data example 


Table 1: Lists of movies that characterize each of the top 3 singular vectors 


1st 

singular 

vector 

One side 

(well-known, top-rated) 

Silence of the Lambs, Fargo, Star Wars, 
Return of the Jedi, Raiders of the Lost Ark 

The other side 
(unknown, pooly-rated) 

A Further Gesture, Mat i syn, 

A Very Natural Thing, Hush, Office Killer 

2nd 

singular 

vector 

One side 

(box-office hit in 90’s) 

Scream, Air Force One, The Rock, 
Contact, Liar Liar 

The other side 
(classic in 40’s-60’s) 

Citizen Kane, The Graduate, Casablanca, 
The African Queen, Dr. Strangelove 

3rd 

singular 

vector 

One side 
(action, thriller) 

Jurassic Park, Top Gun, Speed, True Lies, 

Batman 

The other side 
(drama) 

11 Postino, Secrets & Lies, English Patient, 
Full Monty, L.A. Confidential 


<t 2 , but we replaced these with the estimated values M(s), U, V, A i,p, and fp/rip 2 . 
From Figure [5j we observe that all 3 factors of movie preferences are significant. 


To find the RMSE of our estimator of the full low rank matrix, M(s), we 
used 5 training and 5 test data sets from 5-fold cross validation which is publicly 


provided in GroupLens (2015). The RMSE was computed by 




I a 


test 


where f \ es t contains indices of observed entries that belong to the test set, Vn test 
for a matrix A e W ixd denotes the projection of A onto Sltest, and latest] denotes 
the cardinality of Qtest- The average of the resulting RMSEs was 1.656. 
















95% confidence intervals for 
the top 3 singular values 



Figure 5: The 3 estimated singular values and their 95% confidence intervals. 


6 Proofs 


6.1 Proofs for Theorem [l] 

The proof of the following proposition and lemmas are in Appendix |A.2| 

Proposition 2. Under the model setup in Section^and Assumption^ we have 
for large n and d, 
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sin(P p ( m ),pM) 
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Ci n 


-l 


v (*& - K 


and 


m+l) 


(16) 
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sin 
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C 2 d 


-l 


F P( b m~ b m+ 1) 2 ’ 

where V p and U p are defined in Q and C\ and C 2 are generic constants free of 


n, d, and p. 


















6.1 Proofs for Theorem [Tj 


Lemma 3. Under the model setup in Section and Assumption [TJ for any given 
> 0, there exists a large constant C pi > 0 such that 


1 

nd 



< C Pl max • 


log n 


P~ 


P 


3/2 


logn 


n 


(17) 


with probability at least 1 — O (n Ml ), where T, p is defined in Q. Similarly, for 
any given p 2 > 0, there exists a large constant C p 2 > 0 such that 


1 

nd 


S pt ~ IE 


< C P2 max 


P~ 


logn 


P 


,3/2 


logn 


with probability at least 1 — O (n M2 ), where Y, p t is defined in Q. 

Lemma 4. Under the model setup in Sectionand Assumption [7} for any given 
u\ > 0, there exists a large constant C Vl > 0 such that 


1 

nd 



< C n 


logn 1 
nd d 


(18) 


with probability at least 1 — O (n V1 ), where Y, p and are defined in © and 
Q, respectively. Similarly, for any given v 2 > 0 ; there exists a large constant 
C v 2 > 0 such that 


1 

nd 



<c nP w 


logn 1 
nd n 


with probability at least 1 — O (n l " 2 ), where T> pt and Yi pt are defined in ([6]) and 
Q, respectively. 


Lemma 5. Under the model setup in Section [1] and Assumption^ we have for 
large n and d. 
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nd 



< C 1 max 


p 3 ( 1 — p) p 2 ( 1 — p) 1 

nd 3 ’ n 2 d 5 / 2 J 


and 
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nd 
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P 3 (l ~P) 
dn 3 


P 2 ifi-P) \ 
d 2 n 5 / 2 | ’ 


(19) 


where T> p and T, p t are defined in ([6]), Ti p , Ti p t, V p , and U p are defined in Q, and 
C\ and C 2 are generic constants free ofn,d, and p. 









































6.1 Proofs for Theorem [Tj 

Proof of Theorem, [7]. We only prove ( |10[ ) because © can be proved similarly. 
By triangle inequality and Proposition [2j we have 

E||sin ( V^ m \v (m) )\\ 2 F 

< 4E||sin f p + 4E||sin (yM, yM) \\ 2 p 

<4E||sin(V-W,vW)||^+ Cn 'f (20) 

Now, consider E||sin (p( m ), Vp m ' > ) ||^. Let 

Ei = < max —-|A;. — XT I < ti 1 , 

\i<i<d nd ] Pl Pl1 J 

where h = C[ p 1 ^ + C'fp 3 / 2 ^^, and 

E ' 2 = {^ A *W “ A ^+J < t2 ] ■ 

where t 2 = C 2 p 3 y/ 2 g• Then, by Weyl’s theorem (Li 
and Lemma [4j we have for large constants C [, C '[, and 62 , 


(1998a)), Lemma 


P(£?) < P ( — 
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- EE ? 


P (Ef ) < p (A 

Thus, for large n and d, 


Ep — Sp 


> tij = O (n 4 ) and 

2 >t 2 ) =0 (rT 4 ) . 


E||sin(p( m ),Pp( m ))||^ 

= E|||sin(t>M,^))||^ 1(Ein£2) 

+E{||sin(FM,^M)||^i £i n E2 
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m { E( 1 B c) + E(l Bf ) + E< —-L--- 
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6.2 Proofs for Theorem jg] 


< cn 4 + E < 




^Ep s 

Tp (m) 

2 

f! 


(2nd P'Pra 'W+lL 

2 ) 


< cn 4 + 


C{l~p) f 1 1 1 

(bl - bl +1 ) 2 “ \ pnd 3 ’ p2„2 d 5/2 J’ 


( 21 ) 


where 1# is an indicator function of an event E. the first inequality holds by the 
fact that || sin(V’^ m ), V p m ^)||^ < m and Davis-Kahan sin# theorem (Theorem 3.1 
Li (1998b)), and the last inequality is due to Lemma [5] 

□ 


m 


By (20) and (21), the result (10) follows. 


6.2 Proofs for Theorem [2] 

The proof of the following propositions are in Appendix |A.3 


Proposition 3. Under the assumptions in Theorem^ we have 

|A& ££.(*?+ ^ 2 )p/ V&El^ + n^) 

— > Af (0, 1-2 ) in distribution, as n , d —> oo, 

where A pi, X t , and p are defined in Q, and respectively, and L n< i = 
Tl d e M 2x2 consists of 

Ml n n d Cm } 2 a 2 m 

(r nd )n = xiEE b ‘ u ^ +4 fiT, b l 

P k= 1 h= 1 l 2=1 ) V i =1 

/ v o / \ 2 

/ m \ " / m \ " 


(r n <i)l 2 = 2 p 2 (l -p) ’ and ( T nd) 22 =p 5 (l-p)\^ 2 b ij ■ 

Proposition 4. Under the model setup in Section and Assumption [7J let 

= ^ _ r (W) j 

where E p and V pc are defined in 0- Then, we have t p — np 2 n 2 = O p (p^/n). 
Proof of Theorem We have 






























6.2 Proofs for Theorem jg] 
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First, consider the term (a). We have 

m 

(a) = 4 tr (y {m)T %V {m) ) - E [A 2 + na 2 ] 

P i=l 

m 

l)T E p t> (m) ) - E [A 2 + na 2 

Z— 1 

+ j A tr (v {m)T ^pV {m) ) - ^2 tr (v r(m)T S p V r M) | 

+ j A tr (^ (m)T Spi> (m) ) - ^tr (v'M T E # V r ( m >) | 

= (i) + (ii) + (in). (22) 

By Q , there is O € V m , m such that 

A (m) _ V {m) 0 f F < 2 ||sin(t>^, V^)\\ 2 f and OjV^ T t p V^Oi = A 2 ,, 


where Oi is the i-th column of O. Then, the term (i) is 
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where the last equality is due to (§, ©, and ( fihj] ) below; by the application of 
Weyl’s theorem (Li (1998aj)) and Lemma [3j we can show 
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where the second inequality is due to Holder’s inequality and the last equality 
holds by the fact that 
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where the third equality is due to (26), p4]), Proposition [3j and the fact that 
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6.3 Proofs for Theorem 0] 

The proof of the following Proposition is in Appendix |A.4 


Proposition 5. Under the model setup in Section [£| Assumption [7J and As¬ 
sumption's), we have 

M{s 0 )-M 0 || f = -^O p (n), 

where M(so) are defined in (13) and (15) and Mq is defined in ([Tj). 

Proof of Theorem [/]. For any given r) > 0, we have for a large n, 
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where the second inequality holds due to Assumption [2^1) and Proposition [5j □ 



























A Appendix 

A.l Proofs for Lemma [T] 

Proof of Lemma [7} Let 

My = [(Vkh ~ P)Mokh\l<k<n,l<h<d an< ^ e V = [ykh e kh\i<k<n,l<h<d > 

both in M nxd . Then, M = pM ,q + M y + e y and 

S = p 2 Mq Mo + Aiy M y + e y e y 

+pMo M y + pM y Mq + pMq e y + pely Mo + M y e y + e y M y .(A.l) 

The result © follows since under the model setup in Section |j 

EMy = 0, Ee y = 0, E(M y M y ) = p( 1 - p) diag (MqM 0 ), E(e y e y ) = npa 2 I d , 

E (Mq M y ) = 0, E(M^ey) = 0, and E(M y e y ) = 0. 

We can similarly show the result ([3]). □ 


A.2 Proofs for Section 16.II 


Proof of Proposition^ We only show the result (16), since the other result can 
be shown similarly. 

Let 
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where t = Cip^p _|_ C 2 P 3 / 2 ■ Note that fj -H> (1. By Weyl’s theorem (Li 
(1998a)) and Lemma [3j we have for large constants C\. C -2 > 0, 
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A.2 Proofs for Section 


6.1 
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where 1# is an indicator function of an event E. the first inequality is due to the 
fact that || sin(yl m i, Vp m ^)\\ 2 F < m and Davis-Kahan sin# theorem (Theorem 3.1 
Li (1998b)), and the last inequality holds by Lemma [6] below. 
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Lemma 6. Under the model setup in Section [1] and Assumption [7J we have for 
large n and d, 


and 
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where E p and are defined in © and C\ and C 2 are generic constants free of 
n, d, and p. 
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Proof of Lemma [h| We only show the result (A.3) because the other result holds 
similarly. 

From (A.l), Q, Proposition [IJ and triangle inequality, we have 

(E p -EEp)v (m ) 

[My My - (1 - p)diag(MyMy) - p 2 ( 1 - p) diag(M(fM 0 )] 

[ e y e v ~ (1 -p)diag(ej’ey) - np 2 a 2 I d ]V {m) 

[My Mo - (1 -p)diag(Mp T M 0 )]P (m) 

[M^My - (1 -p)diag(M 0 T Mp)]F (m ) 

[tyM 0 - (1 -p)diag(epM 0 )]P (m) 

[Ml e y ~ C 1 - p)diag(M 0 T e y )] F (m) 

[Mjep - (1 -p)diag(Mje y )]y (m) 
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A.2 Proofs for Section 


6.1 


+ 


[e^M y - (1 - p)diag(e^M y )] V {m) 


= (A) + (B)+p (C) + p (D) + p (E) + p (F) + (G) + (H). 

We examine the convergence rates of the above terms, ( A)-(H ). 
First, consider the term (A) in (A.4). Then, we have 

E 


(A.4) 


i= 1 j=1 v k=l h=l 


d m ( n d 


[My M y - (1 - p)diag(My My) - p 2 { 1 - p) diag(M 0 T M 0 )] 
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EE] EE p 2 E (fe-p) 2 -p(!-p)) Mvtvfi i (h=i) 

+ E ((yw - p) 2 (yfch - p) 2 ) M oliM 0 2 kh v 2 h i (h?4i) 
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d m ( n d - 

EE EE ,Jn “ 2 “- 4 " 2 


2=1 J = 1 V fc=l h=l L 


d m n d 


< P 2 (i - p)£ 4 EEEE y j 

2=1 J=1 fc=l /l=l 
d m n 

= p 2 (i-p) l 4 EEEi 

i=l j=l fc=l 

< Cp 2 (l — p)nd. 


2 


(A.5) 


Similarly to (A.5), we can show that the expected values of the terms ( B ), (D), (F), (G), 
and (H) squared are bounded by Cp 2 nd. 

Second, consider the term (C) in (|A.4[). Then, we have 
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[MjMo - (1 -p)diag(M y T M 0 )]F (m) 


d m 


EE e E(^-p) e 
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P M 0 2 ki Vji t(h=i) 
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d m n ( d 

EEE ^(Vki ~ P ) 2 l ^ Mo ki MokhVjh [l — (1 — p)l(h=«) 

2=1 j =1 fc=l V /l=l 

d m n ( d ^ ^ 

p(i -p) EEE E MokiMQ k hVjh [l — (1 — p)l(/ l =i) 

2=1 j=l k= 1 l fo=l 


d m n ( d \ * 

<p(i-p)L^f2^2\ Em | 

*=i j=i fc=i i h=i j 
< C*p(l — p)nd 2 , 


(A.6) 


where the last inequality holds due to Cauchy-Schwarz inequality. 


Lastly, for the term (E) in (A.4), 
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^ 2 EEE E M QkhVjh [l — (1 — P)^(h=i) 
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<P^L 2 EEE j Efel I 

i =1 7=1 fc=l l h= 1 J 


< Cpnd 2 , 

where last inequality holds due to Cauchy-Schwarz inequality. 


The result follows from (A.5)-(A.T). 


(A.7) 


□ 


Lemma 7. Under the model setup in Section [S] and Assumption [7J we have for 
any given £i > 0, 

11 My 11 2 < Q, y/pnlogn 

with probability 1 — ©(n - ^ 1 ). Similarly, we have for any given £2 > 0, 


12 


< c 6 \/p n log n 


with probability 1 — 0(n ^ 2 ). 
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Proof of Lemma\7\ Let My )J> E M. nxd be such that 
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(i,j) _ j (Vkh ~ p)M 0kh , (k, h) = ( i,j) 


Vkh 


o, 


(k,h) ^ (■ i,j) 


for 1 < k < n and 1 < h < d. 


Then, 


n d 


nd^ y nd 


^EE^ 


*= 1 3 =1 


E (My’^) = 0, and ||M ^*’^|| 2 < L for all 1 < k < n and 1 < h < d. Also, we have 


-^E 
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p{l-p) 


nd 

p{l-p) 


nd 


diag (MqMq) 
diag (Mq Mo) 


< P -P and 


n 


< 


pL 2 


(A. 8 ) 


Thus, by Proposition 1 in Koltchinskii et al. (2011a), we have 

< C 


nd My 


< Cm ax 


pL 2 /logn ^log?r 


d V nd 


nd 


p logn 
nd 2 


with probability at least 1 — n ^. 


we 


In a similar way together with Proposition 2 in Koltchinskii et al. (2011a), 
can show that ||^a %|| 2 < with probability at least 1 — n - ^ 2 . 


□ 


Proof of Lemma [3| We only show the result ( fl7| ) because the other result holds 
similarly. 

From (AT}, Proposition [l] and triangle inequality, we have 

j Lip — EEp 

nd 2 


< ^ \\ M y My-(1- p)di&g(My My) -P 2 {1- p) diag(M 0 r M 0 )\\ 

+ nd ^ e y Cy ~~ ( X ~ P) dia g( e y e y) ~ np 2 (J 2 I d \\ 2 
+2 ^ \\pMy Mq - (1 - p)pdiag(My M 0 )|| 2 
+ 2 ^ ||pepM) - (1 — p)pdiag(Cy M 0 )|| 2 

+ 2 ^ - (1 -p)diag(Mjep)|| 2 

= (/) + (//) + 2 (I//) + 2 (IV) + 2 (V). 


(A.9) 
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Because of similarity, we provide arguments only for (I) and (IV). 
Consider the term (I) in (A.9). First, we have by Lemma[7] 


A IKm„|| 2 = nd 


nd My 


< Cp 


logn 
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(A.10) 


with probability at least 1 — 0(n w ). Also, we have with probability at least 

1 — 0(n -/il ), 
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< 


nd 
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diag (M^M y ) - p( 1 - p) diag (M%M 0 )\\ 
p(l-p) 
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H--— max 

nd l<h<d 


< C 
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Okh 
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-l 


n d 


(A-11) 


where the second inequality holds by (A.12) below. Take t 2 = c l M^p(\ — p)(3p 2 — 
3 p + 1) for some large constant c > 0. Then, by Bernstein’s inequality, 

[(Vkh ~ P? ~ p( 1 - p)\ M 0 2 kh 
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l<h<d 


d 

sE' 

h= 1 
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2L‘ i p(l — p)(3p 2 — 3p + 1) 


(A.12) 


By (A. 10) and (A. 11), we have 


(/) < Cp 


log n 


(A.13) 


with probability at least 1 — 0(n w ). Similarly, we can show that (II) and (V) 
are bounded by with probability at least 1 — 0(n -/il ). 
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Consider the term (IV) in (A.9). We have 
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(l V f < max ■£ 
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(IV) 

kij 
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1< i<d 
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fc=i 


(/V) 

kij 


where ndxj^p = pyki^kiMokj^-(i^j) + P 2 yki^kiM 0kj t( i= j) and hence ’ are 
centered sub-Gaussian random variables under the model setup in Section [2j 
Then, we have for any p E M and for all 1 < k < n, 1 < z < d, and 1 <j< d. 

{ 2 p 3 /3 1 

—> for some constant (3 > 0. 

Take t 2 = for some large constant c > 0 and p = . Then, by 




Markov’s inequality, 
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Similarly, 
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l<i<d 


E 
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E x . 

k=l 


(IV) 

kij 


> t < Cn~ 


(A.14) 


(A.15) 


By (A.14) and (A.15), with probability at least 1 — O (n w ), 

\(IV)\<Cp 3 / 2 ^^. 


(A.16) 


Similarly, we can show that (III) is bounded by Cp 3 / 2 ywith probability 
at least 1 — 0(n _Ml ). 

The statement is showed by (A. 13) and (A.16). □ 
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Proof of Lemma^ We only show the result (18) because the other result holds 
similarly. 

By triangle inequality, we have 


1 

nd 

< 


Tip T p 


\P~P\ 

nd 


1 

nd 


{.P ~ P) diag(S) 


| ||diag(E) - diag(pM(f M 0 + npa 2 I d ) ||, 


+ \\diag(pM^M 0 + npa 2 I d )\\ 2 y (A.17) 


We will look at the terms in (A.17) one by one. 

By Bernstein’s inequality, we have for large constant C > 0, 
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( n d 

k=1 h=l 

< 2 exp{— v\ logn} 
= 2 n~ Ul . 


p( 1 — p) logn 
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> Cy/p(l — p)nd\ogn 


(A.18) 


Take t 2 = c pl °^ 2 n for some large constant c > 0. Then, since + 

£fci) 2 — p(M, o|j + cr 2 ),/c = l,...,n, are independent centered sub-exponential 
random variables, we have by Proposition 5.16 in Vershynin (2010), 
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Also, note that 
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(A.19) 
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(A.20) 
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Combining the results in (A.17)-{A.20), we have 


Yip — Y p 


2 < cA 2 / 

with probability at least 1 — 0 {n ~ vi ). 


1 

nd 


logn 1 
nd d 


(A.21) 

□ 


Proof of Lemma\^ We only show the result (19) because the other result holds 
similarly. 

We have 
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< m E 
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< m E < (p — pY 
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(A.22) 


n 2 d 5 / 2 ' z nd 3 

where the fourth inequality holds by Holder’s inequality and the fifth inequality 
is due to the fact that 
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O (p 2 ( 1 — p) 2 n 2 d 2 ) O ( p 2 n 2 d ) 
n 8 d 8 


(A.23) 
□ 


A.3 Proofs for Section 16.21 

Lemma 8. Under the model setup in Section [£] and Assumption [7J we have 
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and (i) + (rz) = 0 P (^Jp 8 nd)j, where \ p i and A* are defined in Q and (JTj) . 
respectively. 

Proof of Lemma [5| We have 
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= (a) + (6) + (c) + (d) + (e) + (/), (A.24) 

where O € V mim is a solution to infg e y m m ||Vp m ^ — V^Q\\ F and the fourth 
equality holds by © and (|A.1[). Below, we examine the six terms (a)-(f) one by 


one. 


The term (o) in (A.24) is 
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Note that the two terms in (A.25) are centered and uncorrelated with each other. 
So, the variance is 
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(A.26) 


where the Hrst inequality is due to Jensen’s inequality. This shows that the term 
(a) is O p (py / n). Similarly, we can show that the terms ( b ) and (e) are O p (p^/n). 
The term (c) in (A.24) is 
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The term (d) in (A.24) is 
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The term (f) in (A.24) is 
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(A.29) 

where is the i-th column of O and the last equality holds by Proposition [2j 
Q, and (25). 

Therefore, the result follows from (A.24)-(A.29). □ 
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where the second equality holds by Lemma [ 8 j Since the terms (a) and ( 6 ) are 
centered and not correlated with each other under the model setup in Section [2j 
we have 
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where the third equality is due to (A.27), (A.28) and Assumption [lj 1 ). Note 
that 
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Thus, Liapunov’s condition is satisfied with (a) + ( b ) because we have 
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where the first inequality holds by Assumption [lj 1 ), and the last two lines 
due to Cauchy-Schwarz inequality. 

By (A.30)-(A.33), Liapunov CLT and Slutsky theorem, we have 
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Proof of Proposition Similarly to the proof of (A.24), we have 
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where 0 G Vd- r ,d-r is a solution to infg e v d _ rd _J[17c — T7Q|| F , and the third 
equality is due to the fact that MqV c = UAV T V C = 0. We will show that (A)-(F) 
are O p (p\/n). 

Since the first five terms, (A)-(E), are centered, we only need to check their 
variances to find their rates. The variances of the terms (A), (B), and (E) are 
O ( p 2 n ), which can be shown similarly to the proof of (A.26). The variance of 
the term (C) is 


d—r 


var (C) < 


d — r 


(diag (m£m 0 ))V ci 


1 


d — r 
1 

d — r 


i —1 
d—r 

E 


var 


n d 


- p) M olhVc 


2 

cih 


i= 1 L fc=l h=l 
d—r 


E 


L 4 ^2°(p(l-p)) 


i =1 L k =1 


= 0(pn), 


where the inequality is due to Jensen’s inequality. Similarly, the variance of the 
term (D) is 0(pn). 


Now, consider the term (F). Similarly to the proof of (A.29), 
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where Oi is the i-th column of O, the third inequality can be derived similarly 
to the proof of (24), and the last equality holds by Proposition [2] and (25). □ 

A.4 Proofs for Section 16.31 

Proof of Proposition^ Let A a . = A* — A*, A^ = sign((t/j, Ui))Ui — f/j, and 
A Vi = sign((p, Vi))Vi~Vi for alH E {1,..., r}. Similarly to the proof of Theorem 
[ 2 j we can show that for all i = 1 ,..., r, 
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where the third equality holds due to (A.34) and Theorem [I] 
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A.5 Proofs for Lemma [2] 


A.5 Proofs for Lemma jg] 


Proof of Lemma^ By Weyl’s theorem (Li (1998a|), Lemma |3j and Lemma [4j 
for any given 6 > 0, there exists a large constant Cs > 0 such that 
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with probability at least 1 — 0(n 5 ). Also, by definition of f, we have 


{r = rj = {Xj r >p 2 nC d ,Xj r+1 <p 2 nC d } 

= | [Xp r - p 2 {\ 2 r + na 2 )] +p 2 (\ 2 r + na 2 ) > p 2 nC d , 

[A| r+1 - p 2 no 2 ] +p 2 na 2 <p 2 nC d ^ 1 ( A.36) 

where X 2 = b 2 nd by Assumption [lj 1). The result follows by (A.35) and (A.36). 
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